General instructions for all assignments:
R Markdown file (named as: [AndrewID]-HW08.Rmd – e.g. “sventura-HW08.Rmd”) to the Homework 08 submission section on Blackboard. You do not need to upload the .html file.Organization, Themes, and HTML Output
(5 points)
warning = FALSE and message = FALSE in every code block.ggplot theme and use of color:ggplot() color scheme.color = "black").library(tidyverse)
library(data.table)
# Simple theme with white background, legend at the bottom
my_theme <- theme_bw() +
theme(axis.text = element_text(size = 12, color="indianred4"),
text = element_text(size = 14, face="bold", color="darkslategrey"))
# Colorblind-friendly color palette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2",
"#D55E00", "#CC7947")
(3 points each)
Text Annotations on Graphs
Loading Data.
student <- read_csv("https://raw.githubusercontent.com/sventura/315-code-and-datasets/master/data/students.csv")
student$Grade <- factor(student$Grade)
student$AbsentDays <- factor(student$AbsentDays)
student_mds <- student %>%
dplyr::select(RaisedHands, VisitedResources,
AnnouncementsView, Discussion) %>%
scale() %>%
dist() %>%
cmdscale(k = 2) %>%
as.data.frame() %>%
mutate(Grade = factor(student$Grade,
levels = c("H","M","L"),
labels = c("High","Middle","Low")),
AbsentDays = student$AbsentDays,
StageID = factor(student$StageID,
levels = c("lowerlevel","MiddleSchool","HighSchool"),
labels = c("Elementary","Middle","High"))
)
colnames(student_mds)[1:2] <- c("mds_coordinate_1", "mds_coordinate_2")
basic_labs = labs(title = "First 2 MDS Coordinates",
x = "MDS First Coordinate",
y = "MDS Second Coordinate")
ggplot(data.frame(student_mds),aes(x = mds_coordinate_1,
y = mds_coordinate_2,
label = factor(AbsentDays,
levels = c("Above-7",
"Under-7"),
labels= c("A","U") )
)
) +
geom_text() + basic_labs + my_theme
ggplot(data.frame(student_mds),aes(x = mds_coordinate_1,
y = mds_coordinate_2,
label = factor(AbsentDays,
levels = c("Above-7",
"Under-7"),
labels= c("A","U") )
)
) +
geom_text(aes(label = factor(AbsentDays,
levels = c("Above-7","Under-7"),
labels= c("A","U")),
color = Grade),angle = 30)+
scale_color_manual(values = my_colors,
labels = levels(student_mds$Grade)) +
basic_labs +
labs(color = "Student Grade / Mark",
label = "Number of Days Absent") +
my_theme
ggplot(data.frame(student_mds),aes(x = mds_coordinate_1,
y = mds_coordinate_2,
label = factor(AbsentDays,
levels = c("Above-7",
"Under-7"),
labels= c("A","U") )
)
) +
geom_text(aes(color = Grade,
size = factor(StageID),
angle = 30)) +
basic_labs +
labs(color = "Student Grade / Mark",
label = "Number of Days Absent",
size = "School Level") +
scale_color_manual(values = my_colors) +
my_theme
5 variables were included in this model if we consider the MDS coordinates as only 2 variables. Since each is some combination of the 4 continuous variables we could also say 7. We include the first 2 MDS coordinates, what school level they are in (StageID), their grade/ mark in the class (Grade), and a binary variable if their absents were above 7 (AbsentDays).
Takeaway We can think about the 2 MDS components as some combination of the number of times the student raised their Hands (RaisedHands), times utilized student class resources (VisitedResources), number of times they checked new announcements for the class (AnnouncementsView), and number of times the student participated in discussion groups (Discussion). Although MDS is slightly different than PCA, and in both cases it’s not wise to over interpret the components, we see that there is strong clustering of students with similar class grades/marks .Especially in the first MDS component direction we see low performers tend to have smaller values of this coordinate than those average students and much lower than high performers. This might lead us to assume that this coordinate contains information that we associated with success or ability to do well in classes. Additionally first dimension also allows us to nicely cluster those with more than 7 absences and those with fewer, indicated by the text A and U. This structure allows us to see the positive correlation between number of absences and performance of students. These two MDS coordinates didn’t seem distinguish much between the School levels of the student, this is more obvious if you facet by the StageID.
(3 points each)
More Text Annotations on Graphs
Load the Olive Oil data from Lab 08.
olive <- fread("https://raw.githubusercontent.com/sventura/315-code-and-datasets/master/data/olive_oil.csv")
Create a bar chart of the region variable. Use the geom_text() function to add the counts in each bar above each bar in the bar chart. E.g., ggplot(...) + ... + geom_text(stat = "count", aes(y = ..count.., label = ..count..)). Adjust the vjust parameter in geom_text() in order to place the numbers in a more appropriate place. Use a single non-default color for the bars.
region.labels <- c("Northern Apulia", "Southern Apulia", "Calabria", "Sicily",
"Inland Sardinia", "Coastal Sardinia", "Eastern Liguria",
"Western Liguria", "Umbria")
area.labels <- c("South", "Sardinia", "Centre-North")
olive <- olive %>% mutate(region = as.factor(region), area = as.factor(area))
levels(olive$region) <- region.labels
levels(olive$area) <- area.labels
olive %>%
ggplot(aes(x = region)) +
geom_bar(fill = "#cc002b") +
ggtitle("Number of Olive Oils per Region") +
labs(x = "Region", y = "Count") +
geom_text(stat = "count", aes(y = ..count.., label = ..count.., vjust = -0.4)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Create a bar chart of the area variable. Use the geom_text() function to add the counts in each bar in the middle of each bar (i..e halfway up each bar) in the bar chart. Use a larger-than-default font size for the text. Use a single non-default color for the bars, and change the color of the text so that it contrasts well with the color of the bars.
olive %>%
ggplot(aes(x = area)) +
geom_bar(fill = "#cc002b") +
ggtitle("Number of Olive Oils per Area") +
labs(x = "Area", y = "Count") +
geom_text(stat = "count",
aes(y = ..count.. / 2, label = ..count..),
size = 10, colour = "#4d565b") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Repeat part (b), but use label = scales::percent((..count..)/sum(..count..))) to put percentages on the plot instead of the raw count scale. This is nice, since it allows you to quickly see both the percentages and the counts (height of the bars) in each category.
olive %>%
ggplot(aes(x = area)) +
geom_bar(fill = "#cc002b") +
ggtitle("Number of Olive Oils per Area") +
labs(x = "Area", y = "Count") +
geom_text(stat = "count",
aes(y = ..count.. / 2,
label = scales::percent((..count..)/sum(..count..))),
size = 10, colour = "#4d565b") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
(3 points each; 18 points total)
2D-KDEs with Contour Plots and Adjusted Bandwidths
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
geom_density2d() + basic_labs + my_theme
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
geom_point(aes(color = Grade,shape = AbsentDays),size = 5,alpha = .4) +
scale_shape_manual(values = c("Above-7" = "A", "Under-7" = "U"),
labels = c("Above 7","Under 7")) +
scale_color_manual(values = my_colors) +
geom_density2d() + basic_labs +
labs(shape = "Number of Days Absent",
color = "Student Grade / Mark") +
my_theme
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
geom_point(aes(color = Grade,shape = AbsentDays),size = 5,alpha = .7) +
scale_shape_manual(values = c("Above-7" = "A", "Under-7" = "U"),
labels = c("Above 7","Under 7")) +
scale_color_manual(values = my_colors) +
geom_density2d(h=c(1/2,1/2),alpha = .7) + basic_labs +
labs(shape = "Number of Days Absent",
color = "Student Grade / Mark") +
my_theme
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
geom_point(aes(color = Grade,shape = AbsentDays),size = 5,alpha = .7) +
scale_shape_manual(values = c("Above-7" = "A", "Under-7" = "U"),
labels = c("Above 7","Under 7")) +
scale_color_manual(values = my_colors) +
geom_density2d(h=c(3,3),alpha = .7) + basic_labs +
labs(shape = "Number of Days Absent",
color = "Student Grade / Mark") +
my_theme
I prefer the default bandwidth. The larger bandwidth returns an extremely smooth density estimate that doesn’t provide any useful information about the underlying empirical distribution in this problem. Specifically we are missing the peak/ mode associated with the heavy cluster of High preforming individuals with under 7 days of absences (black Us). Generally we see that the clustering of points near to the edges are less likely to given modes, which we want if we have dense group of points in those regions. In general, the peaks of the density estimates should correspond to the actual locations of the points (ideally in areas where there are lots of points), and the valleys in the density estimate should correspond to areas of the feature space where there are few observations. The large bandwidth prevents this from happening.
For the smaller bandwidth, we have too much fluctuation and it is hard to see any true distributional structure. The small bandwidth seems to overfit our data. (If you haven’t learned about overfitting yet, no big deal! You’ll learn this in 36-401 and 36-402.)
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
geom_point(aes(color = Grade,shape = AbsentDays),size = 5,alpha = .7) +
scale_shape_manual(values = c("Above-7" = "A", "Under-7" = "U"),
labels = c("Above 7","Under 7")) +
scale_color_manual(values = my_colors) +
geom_density2d(h=c(1.25,1.5),alpha = .7) + basic_labs +
labs(subtitle = "bandwidth (1.5,1.25)",
shape = "Number of Days Absent",
color = "Student Grade / Mark") +
my_theme
Although I really default setting for this data set, I explored a slightly smaller first coordinate bandwidth as much the observable structure came for the first dimension. This seems to provide contours that made slightly more sense.
(10 points)
2D-KDEs with Heat Maps and Three-Color Gradients
(6 points)
ggplot(student_mds,aes(x = mds_coordinate_1,
y = mds_coordinate_2)) +
stat_density2d(aes(fill = ..density..),
h = c(1.25,1.5), geom = "tile", contour = F) +
scale_fill_gradient2(low = "black",mid = "white",high = "red",midpoint= .03) +
basic_labs + my_theme
(2 points)
I chose these colors to highly the areas of low density vs the areas of high density. Additionally, the white threshold gives a nice cutoff the emphasis the low density and high density regions. Additionally, this color contrast highlights the high density regions.
I chose the midpoint parameter to be 0.03. This value is smaller than middle distance between the highest value and lowest value, but it allows the observer to clearly see 4 modes on the red and to understand that there is some density mass around 0 of the first coordinate and between -1 and -2 of the second coordinate.
(2 points)
I prefer the contour plot over the heat map because it provides more details that we can use to interpret the data. The heat map is a more general display of areas of higher and lower density, but lacks extensive detail about the underlying empirical distribution. (Note: There’s no “right” answer here, in the sense that a contour plot isn’t always better than a heat map. They just represent the data in different ways.)
(Of course, you can always layer everything onto one plot – a heat map, a contour plot, the points themselves, a regression line, etc – if you think this adds value to the plot.)
Hierarchical Clustering and Dendrograms
There are several ways to create dendrograms in R. Regardless of which dendrogram package you use, you’ll first need to create the distance matrix corresponding to your dataset, and submit that distance matrix to hierarchical clustering.
dist_student <- student %>%
dplyr::select(RaisedHands, VisitedResources,
AnnouncementsView, Discussion) %>%
scale() %>%
dist()
hc_student_complete <- hclust(dist_student, method = "complete")
hc_student_complete$method
## [1] "complete"
The object hc_student_complete contains information on the merging of clusters at each iteration, the height of each link, the order of observations for plotting, the label of the observations, the function call, the clustering method, and the distance metric. hc_olive_complete$method contains the cluster method that has been used in constructing the hierarchical clusters. In this case it is set to the complete method.
plot(hc_student_complete, xlab = "Student",
ylab = "Cluster Merge Distance",
sub = "",
main = "Hierarchical Clustering of\n Students using Complete Linkage")
plot() uses a dendrogram to visualize the hierarchical clustering results.
The maximum distance at which two observations are linked can be see in the dendrogram by the highest horizontal bar. This occurs at a height of roughly 6.
Of the two groups linked in the final iterations of hierarchical clustering, both clusters seem to have about 50% of the observations.
labels_complete_2 <- cutree(hc_student_complete, k = 2)
labels_complete_2 is a vector of integers containing 480 elements. Note that this is the same as the number of students in the data set.
table(labels_complete_2) / nrow(student)
## labels_complete_2
## 1 2
## 0.5 0.5
Exactly 50% of the observations are in each of the clusters at the highest split.
library(forcats)
labels_complete_3 <- cutree(hc_student_complete, k = 3)
student_mds <- dist_student %>%
cmdscale(k = 2) %>%
as.data.frame() %>%
mutate(Grade = student$Grade,
AbsentDays = student$AbsentDays,
Labels = labels_complete_3)
colnames(student_mds)[1:2] <- c("mds_coordinate_1", "mds_coordinate_2")
student_mds$Grade <- student_mds$Grade %>%
fct_recode(Low = "L",
Medium = "M",
High = "H") %>%
fct_relevel("Low", "Medium", "High")
ggplot(data = student_mds,
aes(x = mds_coordinate_1, y = mds_coordinate_2,
color = Grade, label = Labels)) +
geom_text() +
labs(title = "2-D MDS Coordinates for Continuous Variables",
subtitle = "Student Academic Performace Dataset",
x = "MDS Coordinate 1",
y = "MDS Coordinate 2") +
my_theme
Most of the low grade students are in cluster 1, although the reverse is not true. The medium grade students seem to be split relatively evenly amongst all the clusters, while most of the high grade students are in clusters 2 and 3.
library(dendextend)
get_colors <- function(x, palette = my_colors) palette[match(x, unique(x))]
student %>%
dplyr::select(RaisedHands, VisitedResources,
AnnouncementsView, Discussion) %>%
scale() %>%
dist() %>%
hclust(method = "complete") %>%
as.dendrogram %>%
set("labels", student_mds$AbsentDays,
order_value = TRUE) %>%
set("labels_col", get_colors(student_mds$Labels),
order_value = TRUE) %>%
ggplot(horiz = T) +
my_theme +
labs(title = "Hierarchical Clusters of Students vs. Days Absent",
subtitle = "Student Academic Performace Dataset",
y = "Pairwise Euclidean Distance",
x = "") +
scale_x_continuous(breaks = NULL)
library(dendextend)
get_colors <- function(x, palette = my_colors) palette[match(x, unique(x))]
student %>%
dplyr::select(RaisedHands, VisitedResources,
AnnouncementsView, Discussion) %>%
scale() %>%
dist() %>%
hclust(method = "complete") %>%
as.dendrogram %>%
set("labels", student_mds$AbsentDays,
order_value = TRUE) %>%
set("labels_col", get_colors(student_mds$Labels),
order_value = TRUE) %>%
set("labels_cex", 0.5) %>%
set("branches_lwd", 0.5) %>%
ggplot(horiz = T) +
my_theme +
labs(title = "Hierarchical Clusters of Students vs. Days Absent",
subtitle = "Student Academic Performace Dataset",
y = "Height",
x = "") +
scale_x_continuous(breaks = NULL)
Hexagonal Bin Plots
(1 point each)
Look up at the documentation on hexagonal bin plots.
A Hex-bin plot is very similar to a Heat-map since the frequencies of both plots are expressed by a color coded gradient.
ggplot(data = student, aes(x = RaisedHands, y = AnnouncementsView)) +
geom_hex(bins = 20) +
scale_fill_gradient2("Density",
low = "blue",
mid = "yellow",
high = "red",
midpoint = 6) +
labs(title = "Joint Relationship between Raised Hands \nand Checked Announcements",
x = "Frequency of Raised Hands",
y = "Frequency of Checked Announcements")
The bins parameter specifies the number of hexagons across the y and x-axis in the graph. Note: there are 20 hexagons across the x-axis and the y-axis since I set bins=20. When the bins parameter is increased, the size of the bins becomes smaller and the number of bins increases. When the bins parameter is decreased, the size of the bins becomes larger and the number of bins decreases.
In our graph, the bottom left has the highest density of the joint distribution of RaisedHands and AnnouncementsView. Higher-density areas of the joint distribution should correspond with red-colored hexagons while lower-density areas of the joint distribution should correspond with blue-colored hexagons, according to our color gradient defined in part b.
The following plot shows the success rate of taking shots in the NHL from each position on the ice, binned by general shooting region. Note that the ‘home plate’ area in the middle shows a dramatically higher success probability; this is an area that many analysts describe as the ‘scoring chance’ area.
Hex-bin plots can be extended so that you can specify both the size of the hexagons (proportional to the frequency of observations in that area) and the color of hexagons (proportional to some other continuous variable for observations in that area). This idea was championed by Kirk Goldsberry, a cartographer who now works applies statistical analysis and data visualization to basketball. These are awesome plots.
Unfortunately, there is no ggplot() implementation for this already. If anyone wants to design one for course credit or academic credit, please contact Sam! (Note: This will not be easy.)
Hex-bin plots are nice when you have a very large number of observations. For smaller datasets like the student dataset, they are typically not as useful as contour plots and heat maps, since they require a small number of large bins to get a good picture of the density of observations.
(2 points each)
Criticize the Lab 08 Oral Evaluation Graphic
On Lab 08, the Oral Evaluation Graphic was something called a “Spider Chart” or “Radar Chart” or “Star Plot”. You can read more about this type of chart here.
Critique the chart from the Lab 08 Oral Evaluation:
We can visualize many continuous variables at once since the circular arrangement allows us to put more information in less space. It is also very easy to compare small number of variables since the variables would be more spread out along its individual axis.
Having multiple polygons in one Radar Chart makes it hard to read, confusing and too cluttered. Especially if the polygons are filled in, as the top polygon covers all the other polygons underneath it. Also, having too many variables creates too many axes and can also make the chart hard to read and complicated. So it’s good practice to keep Radar Charts simple and limit the number of variables used.
All axes are arranged radially, with equal distances between each other, while maintaining the same scale between all axes. Therefore, it is unclear to discern the actual quantity of each variable and compare variables with each other.
Arranging the variables in a different order may affect the size/area of the radar chart. Since each variable value is plotted along its individual axis and all the variables in a dataset and connected together to form a polygon, connecting two larger quantities would produce a much larger area than connecting one large quanitity and one smaller quanitity.Therefore if we re-arrange the variables of a radar chart to assign large quantities together, we would get a chart with larger area, even though they essentially show the same data.
(BONUS: 10 points)
Beyond Default Pairs Plots
library(GGally)
student_sub <- student %>%
dplyr::select(Grade, AbsentDays, RaisedHands,
VisitedResources, AnnouncementsView) %>%
mutate(Grade = factor(Grade), AbsentDays = factor(AbsentDays))
student_sub %>%
ggpairs(upper = list(continuous = "density",
discrete = "ratio",
combo = "facetdensity"),
title = "Variables in Student Data")